Background

In this script, we apply the cellcycleR method on the synchronized cell experiment in yeast in Botstein paper. The data was collected from Yeast Protein Database. The main motivation was to see if the sinusoidal like trends in Figure 1 in the paper can be replicated using the cellcycleR mechanism. Botstein’s group wrote another methods paper where they showed the two eigenegenes to correspond to a sine wave and a cos wave with same phase shift (Check Figs 1 and 3). These were the prior results from their studies that we wanted to validate using cellcycleR.

We here considered only the cdc15 and cdc28 cell lines.

library(qtlcharts)
library(CountClust)
library(parallel)
library(cellcycleR)
library(data.table)
library(binhf)
library(vioplot)
library(limma)

cdc15 Data analysis

Data exploration

The main concern in analyzing this data was the presence of NA values and how to deal with it. There were many genes which had NA values at different cell time points reported. To get around this problem, I substituted every non-leading NA value with the last non-NA value observed. In case of leading NAs, I replaced them by the first non-NA value. Since the cells were already ordered by time, this seemed the most obvious thing to do.

setwd("/Users/kushal/Documents/singleCell-method/project/analysis/")
data <- read.table("../data/Botstein_data/botstein_cdc.txt", sep="\t", header=TRUE, fill=TRUE);
cdc_data <- data[,grep("cdc15",colnames(data))];

fillNAgaps <- function(x, firstBack=FALSE) {
    ## NA's in a vector or factor are replaced with last non-NA values
    ## If firstBack is TRUE, it will fill in leading NA's with the first
    ## non-NA value. If FALSE, it will not change leading NA's.
    
    # If it's a factor, store the level labels and convert to integer
    lvls <- NULL
    if (is.factor(x)) {
        lvls <- levels(x)
        x    <- as.integer(x)
    }
 
    goodIdx <- !is.na(x)
 
    # These are the non-NA values from x only
    # Add a leading NA or take the first good value, depending on firstBack   
    if (firstBack)   goodVals <- c(x[goodIdx][1], x[goodIdx])
    else             goodVals <- c(NA,            x[goodIdx])

    # Fill the indices of the output vector with the indices pulled from
    # these offsets of goodVals. Add 1 to avoid indexing to zero.
    fillIdx <- cumsum(goodIdx)+1
    
    x <- goodVals[fillIdx]

    # If it was originally a factor, convert it back
    if (!is.null(lvls)) {
        x <- factor(x, levels=seq_along(lvls), labels=lvls)
    }

    x
}

cdc_data <- cdc_data[rowSums(is.na(cdc_data)) < (dim(cdc_data)[2] - 4),];
cdc_data_mod <- t(apply(cdc_data, 1, function(x) fillNAgaps(x, firstBack = TRUE)));

cycle_data <- t(cdc_data_mod);
dim(cycle_data)
## [1]   24 5799

We mean center the genes and scale them by the standard deviation.

cycle_data_norm <- apply(cycle_data,2,function(x)  return (x-mean(x))/sd(x))
celltime_levels <- 100;
cycle_data_norm <- cycle_data_norm[, -which(colSums(cycle_data_norm)==0)]

dim(cycle_data_norm)
## [1]   24 5733

cellcycleR application on Yeast cdc 15 data

out <- cell_ordering_class(cycle_data_norm, celltime_levels = 100, num_iter=100, save_path="../rdas/cell_order_yeast_cdc15.rda")

We ran the method above once already (took around 5 minutes) and now we just load the output.

out <- get(load(file="../rdas/cell_order_yeast_cdc15.rda"));
cell_order_full <- cell_ordering_full(out$signal_intensity, dim(cycle_data_norm)[2])

We look at the plots of the amplitudes, phases and the non signal variation of the genes.

amp_genes <- out$amp;
sd_genes <- out$sigma;
phi_genes <- out$phi;

plot(density(phi_genes), col="red", main="Density plot of the phases")

plot(density(amp_genes[-which.max(amp_genes)]), col="red", main="Density plot of the amplitudes")

plot(density(sd_genes[-which.max(sd_genes)]), col="red", main="Density plot of the non-signal sd")

We extract the genes with high SNR - these are more likely to be sinusoidal.

ESS <- amp_genes^2; RSS <- sd_genes^2

SNR <- ESS/RSS;

plot(SNR, col="red", pch=20, lwd=1)

top_genes <- which(SNR > 10);

Next we plot the qtlcharts for these top sinusoidal genes and see if their patterns are indeed sinusoidal or not.

First we present the data in the order of the columns (the cell times ordering reported by the authors).

iplotCurves(t(cycle_data_norm[,top_genes]))
## Set screen size to height=700 x width=1000

Now, the re-ordered cell times by cellcycleR (which do not match with the order of cell times given) are

new_cell_order <- shift(order(cell_order_full),2,dir="right")
iplotCurves(t(cycle_data_norm[new_cell_order,top_genes]))

cellcycleR order vs true (reported) order

temp <- shift(order(cell_order_full),2,dir="right")
print(colnames(cdc_data)[temp])
##  [1] "cdc15_290min" "cdc15_250min" "cdc15_130min" "cdc15_90min" 
##  [5] "cdc15_190min" "cdc15_210min" "cdc15_110min" "cdc15_70min" 
##  [9] "cdc15_150min" "cdc15_100min" "cdc15_200min" "cdc15_80min" 
## [13] "cdc15_120min" "cdc15_220min" "cdc15_240min" "cdc15_180min"
## [17] "cdc15_160min" "cdc15_140min" "cdc15_230min" "cdc15_170min"
## [21] "cdc15_30min"  "cdc15_50min"  "cdc15_270min" "cdc15_10min"

The cellcycleR order does not quite match with the true order but our order gives more sinusoidal patterns, so am not sure which to believe.

cdc 28 Data analysis

Data exploration

We repeat the same exploration as for cdc15 data.

data <- read.table("../data/Botstein_data/botstein_cdc.txt", sep="\t", header=TRUE, fill=TRUE);
cdc_data <- data[,grep("cdc28",colnames(data))];
cdc_data <- cdc_data[rowSums(is.na(cdc_data)) < (dim(cdc_data)[2] - 4),];
cdc_data_mod <- t(apply(cdc_data, 1, function(x) fillNAgaps(x, firstBack = TRUE)));

cycle_data <- t(cdc_data_mod);

cycle_data_norm <- apply(cycle_data,2,function(x)  return (x-mean(x))/sd(x))
celltime_levels <- 100;
cycle_data_norm <- cycle_data_norm[, -which(colSums(cycle_data_norm)==0)]

Applying cellcycleR on cdc28 data

out <- cell_ordering_class(cycle_data_norm, celltime_levels = 100, num_iter=100, save_path="../rdas/cell_order_yeast_cdc28.rda")

We ran the method above once already (took around 5 minutes) and now we just load the output.

out <- get(load(file="../rdas/cell_order_yeast_cdc28.rda"));
cell_order_full <- cell_ordering_full(out$signal_intensity, dim(cycle_data_norm)[2])

We look at the plots of the amplitudes, phases and the non signal variation of the genes.

amp_genes <- out$amp;
sd_genes <- out$sigma;
phi_genes <- out$phi;

plot(density(phi_genes), col="red", main="Density plot of the phases")

plot(density(amp_genes[-which.max(amp_genes)]), col="red", main="Density plot of the amplitudes")

plot(density(sd_genes[-which.max(sd_genes)]), col="red", main="Density plot of the non-signal sd")

We extract the genes with high SNR - these are more likely to be sinusoidal.

ESS <- amp_genes^2; RSS <- sd_genes^2

SNR <- ESS/RSS;

plot(SNR, col="red", pch=20, lwd=1)

top_genes <- which(SNR > 10);

Next we plot the qtlcharts for these top sinusoidal genes and see if their patterns are indeed sinusoidal or not.

First we present the data in the order of the columns (sorted by time)

iplotCurves(t(cycle_data_norm[,top_genes]))

Now, the re-ordered cell times (which do not match with the order of cell times given) are

new_cell_order <- shift(order(cell_order_full),-7,dir="right")
iplotCurves(t(cycle_data_norm[order(cell_order_full),top_genes]))

It seems that the two qtlcharts match up pretty well, also it was found that the relative ordering of the cells as reported by the authors matched pretty nicely with our ordering approach.

cellcycleR order vs true (reported) order

temp <- shift(order(cell_order_full),-7,dir="right")
print(colnames(cdc_data)[temp])
##  [1] "cdc28_0min"   "cdc28_10min"  "cdc28_20min"  "cdc28_30min" 
##  [5] "cdc28_40min"  "cdc28_50min"  "cdc28_60min"  "cdc28_70min" 
##  [9] "cdc28_80min"  "cdc28_120min" "cdc28_160min" "cdc28_150min"
## [13] "cdc28_140min" "cdc28_130min" "cdc28_90min"  "cdc28_100min"
## [17] "cdc28_110min"

The cellcycleR method does a pretty decent job at extracting the true order except towards the end of the cycle when the patterns are not so strong.

elu data

Data preparation

data <- read.table("../data/Botstein_data/botstein_cdc.txt", sep="\t", header=TRUE, fill=TRUE);
elu_data <- data[,grep("elu",colnames(data))];
elu_data <- elu_data[rowSums(is.na(elu_data)) < (dim(elu_data)[2] - 4),];
elu_data_mod <- t(apply(elu_data, 1, function(x) fillNAgaps(x, firstBack = TRUE)));

cycle_data <- t(elu_data_mod);

cycle_data_norm <- apply(cycle_data,2,function(x)  return (x-mean(x))/sd(x))
celltime_levels <- 100;
cycle_data_norm <- cycle_data_norm[, -which(colSums(cycle_data_norm)==0)]

Applying cellcycleR on elu data

out <- cell_ordering_class(cycle_data_norm, celltime_levels = 100, num_iter=100, save_path="../rdas/cell_order_yeast_elu.rda")

We ran the method above once already (took around 5 minutes) and now we just load the output.

out <- get(load(file="../rdas/cell_order_yeast_elu.rda"));
cell_order_full <- cell_ordering_full(out$signal_intensity, dim(cycle_data_norm)[2])

We look at the plots of the amplitudes, phases and the non signal variation of the genes.

amp_genes <- out$amp;
sd_genes <- out$sigma;
phi_genes <- out$phi;

plot(density(phi_genes), col="red", main="Density plot of the phases")

plot(density(amp_genes[-which.max(amp_genes)]), col="red", main="Density plot of the amplitudes")

plot(density(sd_genes[-which.max(sd_genes)]), col="red", main="Density plot of the non-signal sd")

We extract the genes with high SNR - these are more likely to be sinusoidal.

ESS <- amp_genes^2; RSS <- sd_genes^2

SNR <- ESS/RSS;

plot(SNR, col="red", pch=20, lwd=1)

top_genes <- which(SNR > 10);

Next we plot the qtlcharts for these top sinusoidal genes and see if their patterns are indeed sinusoidal or not.

First we present the data in the order of the columns (sorted by time)

iplotCurves(t(cycle_data_norm[,top_genes]))

Now, the re-ordered cell times (which do not match with the order of cell times given) are

new_cell_order <- shift(order(cell_order_full),6,dir="right")
iplotCurves(t(cycle_data_norm[new_cell_order,top_genes]))

cellcycleR order vs true (reported) order

temp <- shift(order(cell_order_full),6,dir="right")
print(colnames(elu_data)[temp])
##  [1] "elu.360min" "elu.330min" "elu.390min" "elu.240min" "elu.210min"
##  [6] "elu.270min" "elu.150min" "elu.120min" "elu.180min" "elu.90min" 
## [11] "elu.300min" "elu.60min"  "elu.30min"  "elu.0min"

For the elu data as well, the cellcycleR reordering seems to match pretty well with the original ordering provided by the authors. Also the trends seem to be pretty similar for the cellcycleR order of the cells and the order of the cells provided in the dataset.

alpha data

Data exploration

data <- read.table("../data/Botstein_data/botstein_cdc.txt", sep="\t", header=TRUE, fill=TRUE);
alpha_data <- data[,grep("alpha",colnames(data))];
alpha_data <- alpha_data[rowSums(is.na(alpha_data)) < (dim(alpha_data)[2] - 4),];
alpha_data_mod <- t(apply(alpha_data, 1, function(x) fillNAgaps(x, firstBack = TRUE)));

cycle_data <- t(alpha_data_mod);

cycle_data_norm <- apply(cycle_data,2,function(x)  return (x-mean(x))/sd(x))
celltime_levels <- 100;
cycle_data_norm <- cycle_data_norm[, -which(colSums(cycle_data_norm)==0)]

Applying cellcycleR on alpha data

out <- cell_ordering_class(cycle_data_norm, celltime_levels = 100, num_iter=100, save_path="../rdas/cell_order_yeast_alpha.rda")

We ran the method above once already (took around 5 minutes) and now we just load the output.

out <- get(load(file="../rdas/cell_order_yeast_alpha.rda"));
cell_order_full <- cell_ordering_full(out$signal_intensity, dim(cycle_data_norm)[2])

We look at the plots of the amplitudes, phases and the non signal variation of the genes.

amp_genes <- out$amp;
sd_genes <- out$sigma;
phi_genes <- out$phi;

plot(density(phi_genes), col="red", main="Density plot of the phases")

plot(density(amp_genes[-which.max(amp_genes)]), col="red", main="Density plot of the amplitudes")

plot(density(sd_genes[-which.max(sd_genes)]), col="red", main="Density plot of the non-signal sd")

We extract the genes with high SNR - these are more likely to be sinusoidal.

ESS <- amp_genes^2; RSS <- sd_genes^2

SNR <- ESS/RSS;

plot(SNR, col="red", pch=20, lwd=1)

top_genes <- which(SNR > 10);

Next we plot the qtlcharts for these top sinusoidal genes and see if their patterns are indeed sinusoidal or not.

First we present the data in the order of the columns (sorted by time)

iplotCurves(t(cycle_data_norm[,top_genes]))

Now, the re-ordered cell times (which do not match with the order of cell times given) are

new_cell_order <- shift(order(cell_order_full),6,dir="right")
iplotCurves(t(cycle_data_norm[new_cell_order,top_genes]))

cellcycleR order vs true (reported) order

temp <- shift(order(cell_order_full),6,dir="right")
print(colnames(alpha_data)[temp])
##  [1] "alpha.84min"  "alpha.77min"  "alpha.63min"  "alpha.91min" 
##  [5] "alpha.112min" "alpha.119min" "alpha.105min" "alpha.56min" 
##  [9] "alpha.98min"  "alpha.49min"  "alpha.70min"  "alpha.42min" 
## [13] "alpha.35min"  "alpha.28min"  "alpha.21min"  "alpha.14min" 
## [17] "alpha.7min"   "alpha.0min"

For this data as well, cellcycleR does fairly well and although not perfectly, but still gets close to figuring out the actual order of cell times.

Final Thoughts

The cellcycleR order seemed to conform pretty well with the actual order reported by the authors for the cdc28, elu and alpha cell lines data, but not so strongly for cdc15 line. Also, for all cases, we produced pretty sinusoidal patterns and also our patterns seem to correspond to the patterns they observed for some genes (I did not look very closely, but it may be worth doing so given that the analysis seems to be working well).